Matthew Talluto
21.02.2022
We will use the two dominant graphics packages during this course.
Here we will work through both, using the histogram as a motivating example.
histhistThe defaults are not very nice, so lets improve things
main = "" disables the titlexlab and ylab control axis labelsbreaks controls the number of bins in the histogramcol sets the color of the barsborder sets the color of the borders (NA: no border)#RRGGBB
00 (none) to FF (most)col = rosybrowncolors() function for the nameshead(colors())
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"ggplot2 is a package: packages extend the base functionality of Rinstall.packages() (just once)library (once per session)ggplot always works on data.frames, so make sure your data is in oneaes
x position, and set a histogram geometry# only do this once, if you do not have the package already installed
# install.packages("gglot2")
# load the package
library("ggplot2")
# make a data frame with a single variable named my_var
my_df = data.frame(my_var)
# create a plot and save it in a variable
x_histogram = ggplot(my_df, aes(x = my_var)) + geom_histogram()
# add some more features
x_histogram = x_histogram + ylab("Frequency") + xlab("Variable of Interest")
# display the plot
print(x_histogram)The population mean (\(\mu\)) can be approximated with the sample mean:
\[ \mu \approx \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]
The mean can be strongly influenced by outliers.
The mean can be strongly influenced by outliers.
# mode
mode(my_var) # wrong!
## [1] "numeric"
# we can use the histogram function to approximate the sample mode
# changing the number of breaks will have a large impact on the results
my_hist = hist(my_var, breaks = 30, plot = FALSE)
# the mids variable gives you the midpoint of each bin
# counts gives you the count each bin
# cbind shows them together in columns
head(cbind(my_hist$mids, my_hist$counts))
## [,1] [,2]
## [1,] 3 35
## [2,] 5 107
## [3,] 7 161
## [4,] 9 175
## [5,] 11 135
## [6,] 13 130
# use this to get the mode
# first find out which one is the tallest with which.max
(mode_position = which.max(my_hist$counts))
## [1] 4
my_hist$mids[mode_position]
## [1] 9We can compare variables in a way that is location independent by centering (subtracting the mean)
\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]
We can estimate \(\sigma^2\) using the sample variance:
\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]
It is convenient to talk about the scale of \(x\) in the same units as \(x\) itself, so we use the (population or sample) standard deviation:
\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]
range: max(x) - min(x)
interquartile range (IQR): the difference between the third and first quartiles
coefficient of variation (CV): \[ \frac{s}{|\bar{x}|} \]
A boxplot is a nice way to summarize location and dispersion in a dataset
# the range function gives you the min and max
# Take the difference to get the statistical range
diff(range(my_var))
## [1] 50.4
max(my_var) - min(my_var)
## [1] 50.4
IQR(my_var)
## [1] 6.69
quantile(my_var, 0.75) - quantile(my_var, 0.25)
## 75%
## 6.69
# coefficient of variation can be done manually
sd(my_var)/mean(my_var)
## [1] 0.531
ggplot(my_df, aes(y = my_var)) + geom_boxplot() +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())Is the distribution weighted to one side or the other?
\[ \mathrm{skewness} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^3}{(n-1)s^3} \]
How fat are the tails relative to a normal distribution?
\[ \mathrm{kurtosis} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^4}{(n-1)s^4} \]
boxplot functiony ~ group is a special data type called a formulaboxplot functiony ~ group is a special data type called a formulairis datasetspecies, is nominal and thus the mean is not definedmean separately to every variable in irissimple apply (sapply)# bad!
(col_means = c(
mean(iris[,1]),
mean(iris[,2]),
mean(iris[,3]),
mean(iris[,1]), # it's easy to introduce mistakes this way!
mean(iris[,5])))
## Warning in mean.default(iris[, 5]): argument is not numeric or logical:
## returning NA
## [1] 5.84 3.06 3.76 5.84 NA
# better
(col_means = sapply(iris, mean)) # apply the function mean to each variable in iris
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5.84 3.06 3.76 1.20 NAmean separately to every variable in irissimple apply (sapply)# we use only the first four columns, because these functions throw an error with non-numeric data
sapply(iris[,1:4], range)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 4.3 2.0 1.0 0.1
## [2,] 7.9 4.4 6.9 2.5
sapply(iris[,1:4], quantile, c(0.25, 0.5, 0.75))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8tapply: tabular apply
Sepal.Length) based on categories in another variable (e.g., Species)Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 10
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 50
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 500
We can simulate this experiment in R. Here is code for doing it for one person:
We can simulate this experiment in R.
And for a larger sample (2500) friends, doing more coin flips (500 each), with more concise code.
# repeat the initial position 2500 times, one per friend
n_friends = 2500
positions = rep(0, n_friends)
# repeat the experiment 500 times
time_steps = 1:500
for(i in time_steps) {
# flip a coin for each friend; this returns one heads or tails for each n_friends
coin_flips = sample(c("heads", "tails"), size = n_friends, replace = TRUE)
# compute the steps: +0.5 if the flip is heads, -0.5 if tails
moves = ifelse(coin_flips == "heads", 0.5, -0.5)
# finally add the moves to the old positions and save the new position
positions = positions + moves
}density function in R to add a curve approximating this density.hist(positions, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(positions, adjust=1.5), col='red', lwd=2)
mu = mean(positions)
sig = sd(positions)
x_norm = seq(min(positions), max(positions), length.out = 400)
y_norm = dnorm(x_norm, mu, sig)
lines(x_norm, y_norm, lwd=2, col='blue')
legend("topright", legend=c(paste("sample mean =", round(mu, 2)),
paste("sample sd =", round(sig, 2))), lwd=0, bty='n')scale function.scale function.\[ \mathcal{f}(x) = \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} \]
\[ \mathcal{g}(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} dx \]
PDF: what is the probability density when \(x=3\) (the height of the bell curve)
CDF: what is the cumulative probability when \(x=q\)
(area under the bell curve from \(-\infty\) to \(q\))
(probability of observing a value < \(q\))
Quantiles: what is the value of \(x\), such that the probability of observing x or smaller is \(p\)
(inverse of the CDF)
RNG: Random number generator, produces \(n\) random numbers from the desired distribution
rnorm(n = 10, mean= 0, sd = 1)
## [1] 0.606 -1.230 -1.041 0.362 1.731 -0.855 -0.579 1.140 0.852 0.711R supports many distributions, we will discuss others as they come up.
\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]